Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix syncing the aggregated snapshot size between CNSVolumeInfo and CNS query result #3073

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

deepakkinni
Copy link
Contributor

@deepakkinni deepakkinni commented Sep 30, 2024

What this PR does / why we need it:
Fix syncing the aggregated snapshot size between CNSVolumeInfo and CNS query result

General Changes:

  1. Reduces logging. I observed a perf bug with a lot of log spew. Changed it to debug.
  2. Improved logging for quota. Changed the display to readable values instead of bytes.

Bug Fixes:

  1. Fixed bugs that synced CNSVolumeInfo with CNS query result:
  2. validateAndCorrectVolumeInfoSnapshotDetails should not error out, this will prevent other parts of full sync from running
  3. If there are failures we proceed to syncing other volumes rather than erroring out
  4. CNS always returned value in Mb, while CNSVolumeInfo always held in bytes, there was always a mismatch and there would always be patching, this was incorrect.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Testing done:

Changed logging for easier reading

{"level":"info","time":"2024-09-30T20:09:17.190657406Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 108Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"9c668ef6-c4f5-4db6-8229-ab582a3b62f3"}
{"level":"info","time":"2024-09-30T20:09:17.198981664Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 108Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"751d9b7f-69bd-4288-9b37-3d601b93de8d"}
{"level":"info","time":"2024-09-30T20:09:17.206634972Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 108Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"9f4df210-aabe-443a-9461-3fce41546aed"}

Deployed changes on live setup and ensured that CNSVolumeInfo was being fixed appropriately.

{"level":"info","time":"2024-09-30T20:09:17.072054174Z","caller":"cnsvolumeinfo/cnsvolumeinfoservice.go:304","msg":"attempt: 1, Successfully patched the cnsvolumeinfo \"57268eaa-adff-4c9b-a466-62b43b9d21da\" namespace \"vmware-system-csi\"","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072079011Z","caller":"syncer/fullsync.go:841","msg":"Updated CNSvolumeInfo with Snapshot details successfully for volume 57268eaa-adff-4c9b-a466-62b43b9d21da","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072179479Z","caller":"syncer/fullsync.go:799","msg":"Aggregated Snapshot size mismatch for volume 96210bac-bc06-4229-9a35-6783e4fce39a, 0 in CnsVolumeInfo and 108MB in CNS","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072189017Z","caller":"syncer/fullsync.go:811","msg":"Update aggregatedSnapshotCapacity for volume 96210bac-bc06-4229-9a35-6783e4fce39a to 108MB","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072197192Z","caller":"syncer/fullsync.go:820","msg":"snapshot operation completion time unavailable for volumeID 96210bac-bc06-4229-9a35-6783e4fce39a, will use current time 2024-09-30 20:09:17.072194117 +0000 UTC m=+1.245091382 instead","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.072204676Z","caller":"common/util.go:459","msg":"retrieved aggregated snapshot capacity 108 for volume \"96210bac-bc06-4229-9a35-6783e4fce39a\"","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.077780548Z","caller":"cnsvolumeinfo/cnsvolumeinfoservice.go:304","msg":"attempt: 1, Successfully patched the cnsvolumeinfo \"96210bac-bc06-4229-9a35-6783e4fce39a\" namespace \"vmware-system-csi\"","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.077791137Z","caller":"syncer/metadatasyncer.go:3518","msg":"cnsVolumeInfoCRUpdated: aggregated snapshot size increased by 144Mi, increased Used field for storagepolicyusage CR: vsan-default-storage-policy-snapshot-usage","TraceId":"eccd165f-ff82-4559-b603-b7ccc5cb1891"}
{"level":"info","time":"2024-09-30T20:09:17.077801948Z","caller":"syncer/fullsync.go:841","msg":"Updated CNSvolumeInfo with Snapshot details successfully for volume 96210bac-bc06-4229-9a35-6783e4fce39a","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}

Verified that volumes that were not in sync was determined correctly

{"level":"info","time":"2024-09-30T20:09:17.077810253Z","caller":"syncer/fullsync.go:847","msg":"Number of volumes with synced aggregated snapshot size with CNS 957","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}
{"level":"info","time":"2024-09-30T20:09:17.077818228Z","caller":"syncer/fullsync.go:848","msg":"Number of volumes with out-of-sync aggregated snapshot size with CNS 67","TraceId":"f42ac652-413b-4429-b818-91f91fcef2af"}

Incorrect SPU before fix:

apiVersion: cns.vmware.com/v1alpha2
kind: StoragePolicyUsage
metadata:
  creationTimestamp: "2024-09-26T22:53:32Z"
  generation: 1
  name: vsan-default-storage-policy-snapshot-usage
  namespace: ns1
  resourceVersion: "3726520"
  uid: 66340f39-712b-4300-869d-cae84edbb5f4
spec:
  resourceApiGroup: snapshot.storage.k8s.io
  resourceExtensionName: snapshot.cns.vsphere.vmware.com
  resourceKind: VolumeSnapshot
  storageClassName: vsan-default-storage-policy
  storagePolicyId: aa6d5a82-1c88-45da-85d3-3d74b91a5bad
status:
  quotaUsage:
    reserved: "0"
    used: "0"

Corrected SPU after fix:

apiVersion: cns.vmware.com/v1alpha2
kind: StoragePolicyUsage
metadata:
  creationTimestamp: "2024-09-26T22:53:32Z"
  generation: 1
  name: vsan-default-storage-policy-snapshot-usage
  namespace: ns1
  resourceVersion: "3752696"
  uid: 66340f39-712b-4300-869d-cae84edbb5f4
spec:
  resourceApiGroup: snapshot.storage.k8s.io
  resourceExtensionName: snapshot.cns.vsphere.vmware.com
  resourceKind: VolumeSnapshot
  storageClassName: vsan-default-storage-policy
  storagePolicyId: aa6d5a82-1c88-45da-85d3-3d74b91a5bad
status:
  quotaUsage:
    reserved: "0"
    used: 8172Mi

Verified that next cycle of full sync detects that the usage is correct

{"level":"info","time":"2024-09-30T20:39:16.709799502Z","caller":"syncer/fullsync.go:847","msg":"Number of volumes with synced aggregated snapshot size with CNS 1024","TraceId":"92e8af49-9cd0-48ca-a95c-6cee81cac6b6"}
{"level":"info","time":"2024-09-30T20:39:16.709836211Z","caller":"syncer/fullsync.go:848","msg":"Number of volumes with out-of-sync aggregated snapshot size with CNS 0","TraceId":"92e8af49-9cd0-48ca-a95c-6cee81cac6b6"}
{

Special notes for your reviewer:

Release note:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepakkinni

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 30, 2024
@@ -502,7 +502,7 @@ func getFSEnabledClustersWithPriv(ctx context.Context, vc *cnsvsphere.VirtualCen
log.Debugf("vSAN file service is enabled for cluster: %+v and vCenter: %q.",
cluster, vc.Config.Host)
} else {
log.Infof("vSAN file service is disabled for cluster: %+v and vCenter: %q.",
log.Debugf("vSAN file service is disabled for cluster: %+v and vCenter: %q.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these messages were intentionally made Info messages as part of RCCA of SR. Please confirm with Divyen and keep required messages as Info. Most of other Debug messages look ok to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/2490/files and skip updating messages from this file.

@vdkotkar
Copy link
Contributor

vdkotkar commented Oct 1, 2024

Overall changes are looking ok to me. When you revert some required messages to info level, will approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants